1 Goal

This is a bachelor thesis to re-analyse mass cytometry data using two new R packages: CytoGLMM and cytoeffect on a dataset with a larger number of donors. The goal is to replicate biological conclusions with other software and uncover new biological findings.

2 Prerequisites

Install packages.

pkgs_needed = c("devtools","tidyverse","magrittr","FlowRepositoryR",
                "flowCore","openCyto","scales","parallel",
                "RColorBrewer","ggcorrplot","SummarizedExperiment",
                "lme4","lmerTest")
letsinstall = setdiff(pkgs_needed, installed.packages())
if (length(letsinstall) > 0) {
  BiocManager::install(letsinstall)
}
# package is still private
devtools::install_github("ChristofSeiler/CytoGLMM")
# Bioconductor version breaks when updating to ggplot2 v3.0
devtools::install_github("RGLab/ggcyto", ref="trunk")

Load packages.

library("CytoGLMM")
## Warning: replacing previous import 'MASS::select' by 'dplyr::select' when
## loading 'CytoGLMM'
## Warning: replacing previous import 'cowplot::ggsave' by 'ggplot2::ggsave'
## when loading 'CytoGLMM'
## Warning: replacing previous import 'stringr::boundary' by
## 'strucchange::boundary' when loading 'CytoGLMM'
library("tidyverse")
library("magrittr")
library("FlowRepositoryR")
library("flowCore")
library("openCyto")
library("ggcyto")
library("scales")
library("parallel")
library("RColorBrewer")
library("ggcorrplot")
library("SummarizedExperiment")
library("lme4")
library("lmerTest")

3 Introduction

3.1 Mass cytometry

Mass cytometry (MC) is a method that examines multiple parameters in cells at a single-cellular level in a high throughput manner. It is derived from flow cytometry and mass spectrometry. It can measure up to 40 parameters per cell, usually proteins. Mass cytometry focuses on proteins, but can also measure posttranslational modifications, products from proteolysis and in some rare cases RNA (Frei et al., 2016). Traditionally, cells were analyzed in bulk, but new technologies such as MC allow for the identification of each cell and therefore much more detailed information about the heterogeneity of cells (Spitzer & Nolan, 2015). The typical workflow of an MC experiment involves cells being labeled with antibodies that have been previously chelated to heavy metal stable isotopes, such as lanthanides. They are then fed into a chamber with argon gas that turns the cells into a gaseous and ionized form. A quadrupole then selects for the particles labeled with heavy metal isotopes, and feeds them into a time-of-flight mass spectrometer that accelerates the particles with the help of an electrical field. At the end, a detector measures the mass-to-charge ratio of the particles allowing their identification. Mass cytometry has a wide range of applications, and has mainly been used in immunology, infectious diseases, oncology and drug development (Di Palma & Bodenmiller, 2015; Spitzer & Nolan, 2016).

3.2 Differential expression analysis: ‘CytoGLMM’ & ‘cytoeffect’

Mass cytometry can process huge amounts of cells which raises issues in processing such high dimensional data. To comprehend and visualize this vast amount of data, researchers often use summary statistics, thereby losing valuable information in these multivariate datasets. Christof Seiler and his team at Stanford (2019) have developed two new R packages to address this issue, and explore data in an uncompressed form thereby improving statistical power and replicability. When analyzing mass cytometry data, there are two main steps: identifying the cells in a process called gating and then comparing different expression levels across cell types and experiment conditions. This differential expression is measured by performing regression, the R packages being used in this paper for this are CytoGLMM and Cytoeffect. With the use of two statistical mixed models, they explicitly look at marker correlations which allows to account explicitly for cell-to-cell and donor-to-donor variability. One advantage of this multivariate method is to highlight possible different confounders in the data. The two models used are the multivariate Poisson log-normal model and the logistic mixed effect model which give different conclusions and used together can give more insight into the data.

4 Data: HIV vaccination

4.1 Dataset characteristics

The dataset being reanalyzed is studying the differences in macaque natural killer (NK) cell protein expression between a first and second vaccination (Palgen et al., 2019). Four macaques (BB078, BB231, BC641, BD620) were vaccinated twice: a prime and a boost injection. Whole blood levels were measured at numerous time points: before vaccination, after the first vaccination at six timepoints, and after the boost four times. There is a two month interval between the two vaccinations. They received the ANRS MVA HIV-B vaccine, encoding HIV-Gag, Pol, and Nef proteins. Not used in this analysis, a second group of animals received a subcutaneous injection with a buffer liquid to control that the reactions observed with MVA were specific to that vaccine. The blood was processed, stained, and acquired with a CyTOF (Fluidigm). The data was bead normalized (Finck et al., 2013) and gated using the software SPADE. NK cell phenotypic families were then determined using hierarchical clustering. To classify immune profiles distinguishing post-prime and post-boost NK cells, least absolute shrinkage and selection operator (LASSO) and linear discriminant analysis (LDA) were performed. The data was released normalized and gated on Flow Repository. In this analysis, a subset of the data has been chosen for simplicity, two time points were chosen: six hours after the prime vaccination (PPH6) and six hours after the boost vaccination (PBH6). The dataset was chosen as it was released gated. Most studies have very difficult to reproduce gating mechanisms.

4.2 Download data

The data can be accessed with the package FlowRepositoryR to download fcs files from FlowRepository. Only download files 6 hours after prime vaccination (PPD000H06) and 6 hours after boost vaccination (PBD000H06).

4.3 Data formatting - sample table

After downloading the data, a sample table is created by parsing the fcs filenames. A local folder was created “Subselection H6PP&PB”.

fcs_filesVA1 = list.files(path = "Subselection H6PP&PB", pattern = "fcs")
map_time = function(x) {
  if (str_detect(x, "PPD000H06")) "Post-prime" 
  else if (str_detect(x, "PBD000H06")) "Post-boost"
  else NA
}

sample_tableVA1 = tibble(
  donor = str_extract(fcs_filesVA1, "_B[B-D]{1}..."),
  term = sapply(fcs_filesVA1, map_time) %>% as.factor,
  file_name = paste0("Subselection H6PP&PB/",fcs_filesVA1)
)
sample_tableVA1$donor = gsub("_", "", sample_tableVA1$donor)

sample_tableVA1
## # A tibble: 8 x 3
##   donor term       file_name                               
##   <chr> <fct>      <chr>                                   
## 1 BB078 Post-boost Subselection H6PP&PB/PBD000H06_BB078.fcs
## 2 BB231 Post-boost Subselection H6PP&PB/PBD000H06_BB231.fcs
## 3 BC641 Post-boost Subselection H6PP&PB/PBD000H06_BC641.fcs
## 4 BD620 Post-boost Subselection H6PP&PB/PBD000H06_BD620.fcs
## 5 BB078 Post-prime Subselection H6PP&PB/PPD000H06_BB078.fcs
## 6 BB231 Post-prime Subselection H6PP&PB/PPD000H06_BB231.fcs
## 7 BC641 Post-prime Subselection H6PP&PB/PPD000H06_BC641.fcs
## 8 BD620 Post-prime Subselection H6PP&PB/PPD000H06_BD620.fcs

4.4 Marker table

A marker table is created by extracting the marker isotopes and protein names. A third column is added to determine whether the markers were used for identifying cells (phenotype), are functional proteins, or were unused. Palgen et al. (2019) state in their paper that due to low reactivity multiple markers needed to be excluded from the analysis.

# names of markers used for gating extracted from the paper
map_type = function(x) {
  if (str_detect(x, paste(c("CD66", "HLADR", "CD3", "CD107a", "CD8", "CD45", "GranzymeB", "CD56", "CD62L", "CD4", "CD11a", "CD2", "CD7", "NKG2D", "CD11c", "CD69", "CD25", "CD16", "CCR5", "CXCR4", "CD14", "perforine", "NKG2A", "CD20", "CCR7"),collapse = '|'))) "phenotype"
  else if (str_detect(x, paste(c("Di", "Time", "Cell_length", "cells", "File Number"),collapse = '|'))) "unused"
  else "function"
}

#Creating marker table
y <- sample_tableVA1$file_name[1]
fcs = read.FCS(y, transformation = FALSE)

isotope = pData(parameters(fcs))[,"name"]
protein_name = pData(parameters(fcs))[,"desc"]
type = sapply(protein_name, map_type)
tb_marker = tibble(isotope, protein_name, type)
tb_marker
## # A tibble: 45 x 3
##    isotope     protein_name type     
##    <I<chr>>    <I<chr>>     <chr>    
##  1 Time        Time         unused   
##  2 Cell_length Cell_length  unused   
##  3 (Rh103)Di   (Rh103)Di    unused   
##  4 (Xe131)Di   (Xe131)Di    unused   
##  5 (Cs133)Di   (Cs133)Di    unused   
##  6 (La139)Di   (La139)Di    unused   
##  7 (Ce140)Di   (Ce140)Di    unused   
##  8 (Pr141)Di   CD66         phenotype
##  9 (Nd142)Di   HLADR        phenotype
## 10 (Nd143)Di   CD3          phenotype
## # … with 35 more rows
#deleting before last two rows, 43 & 44 as their protein_name was 'cells' 
tb_marker <- tb_marker[-c(43,44), ]
tb_marker
## # A tibble: 43 x 3
##    isotope     protein_name type     
##    <I<chr>>    <I<chr>>     <chr>    
##  1 Time        Time         unused   
##  2 Cell_length Cell_length  unused   
##  3 (Rh103)Di   (Rh103)Di    unused   
##  4 (Xe131)Di   (Xe131)Di    unused   
##  5 (Cs133)Di   (Cs133)Di    unused   
##  6 (La139)Di   (La139)Di    unused   
##  7 (Ce140)Di   (Ce140)Di    unused   
##  8 (Pr141)Di   CD66         phenotype
##  9 (Nd142)Di   HLADR        phenotype
## 10 (Nd143)Di   CD3          phenotype
## # … with 33 more rows

As claimed by the authors (Palgen et al., 2019) the data has been normalized and gated. The cells contained are assumed to be natural killer (NK) cells as the paper only focuses on NK cells.

4.4.1 Combine the sample and marker dataset

# setting computational resources
ncores = parallel::detectCores()
# load data
fset = read.ncdfFlowSet(sample_tableVA1$file_name, mc.cores = ncores) 
pData(fset) = cbind(pData(fset),sample_tableVA1)
df_samples = lapply(seq(fset), function(sample_id) {
    marker_ids = which(fset@colnames %in% tb_marker$isotope)
    exprs = as_tibble(exprs(fset[[sample_id]]))[,marker_ids]
    file_name = pData(fset[sample_id])$file_name
    exprs %>% add_column(file_name)
  }) %>% bind_rows
str(df_samples)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1630186 obs. of  44 variables:
##  $ Time       : num  0 1 18 20 28 38 38 43 49 52 ...
##  $ Cell_length: num  18 16 22 25 25 16 24 29 30 19 ...
##  $ (Rh103)Di  : num  -0.62596 -0.9804 -0.00455 -0.93974 -0.13917 ...
##  $ (Xe131)Di  : num  -0.939 -0.431 -0.332 -0.227 -0.608 ...
##  $ (Cs133)Di  : num  -0.672 -0.042 -0.305 -0.618 -0.771 ...
##  $ (La139)Di  : num  -0.753 0.737 2.013 0.66 -0.521 ...
##  $ (Ce140)Di  : num  -0.4145 0.1893 -0.4858 1.6834 0.0661 ...
##  $ (Pr141)Di  : num  321 214 285 190 365 ...
##  $ (Nd142)Di  : num  1.866 7.979 1.981 0.399 2.454 ...
##  $ (Nd143)Di  : num  1.655 0.784 2.231 1.159 2.209 ...
##  $ (Nd144)Di  : num  1.34 1.67 8.47 10.92 4.79 ...
##  $ (Nd145)Di  : num  0.488 0.257 -0.128 1.658 2.641 ...
##  $ (Nd146)Di  : num  20.7 24.7 10.8 42.3 10.2 ...
##  $ (Sm147)Di  : num  -0.175 0.562 -0.799 -0.066 1.991 ...
##  $ (Nd148)Di  : num  1.6758 -0.5441 -0.0705 -0.3159 0.4077 ...
##  $ (Sm149)Di  : num  2.94 5.12 4.23 1.48 7.37 ...
##  $ (Nd150)Di  : num  3.76 15.1 19.23 14.93 29.95 ...
##  $ (Eu151)Di  : num  -0.686 -0.727 -0.937 -0.989 -0.89 ...
##  $ (Sm152)Di  : num  3.321 -0.863 1.287 1.866 0.873 ...
##  $ (Eu153)Di  : num  2.11 1.89 7.74 5.42 4.5 ...
##  $ (Sm154)Di  : num  0.444 3.78 -0.449 0.998 2.422 ...
##  $ (Gd155)Di  : num  0.505 -0.239 1.003 1.893 0.546 ...
##  $ (Gd156)Di  : num  -0.701 -0.508 -0.315 -0.684 -0.683 ...
##  $ (Gd158)Di  : num  -0.746 -0.532 0.515 -0.311 0.287 ...
##  $ (Tb159)Di  : num  -0.038 1.3318 -0.2569 -0.1783 -0.0139 ...
##  $ (Gd160)Di  : num  -0.945 -0.0115 0.6834 -0.7285 0.9801 ...
##  $ (Dy161)Di  : num  2.109 0.652 4.146 0.787 5.814 ...
##  $ (Dy162)Di  : num  3.697 0.785 2.09 1.179 1.723 ...
##  $ (Dy163)Di  : num  -0.3061 0.5433 1.6583 -0.0153 -0.9743 ...
##  $ (Dy164)Di  : num  2.1 2.54 3.09 3.85 5.95 ...
##  $ (Ho165)Di  : num  -0.837 1.202 5.915 6.842 13.71 ...
##  $ (Er166)Di  : num  11.88 6.3 8.78 7.63 14.38 ...
##  $ (Er167)Di  : num  -0.8443 -0.0757 1.0502 -0.3283 1.0173 ...
##  $ (Er168)Di  : num  0.734 1.001 3.536 1.328 2.746 ...
##  $ (Tm169)Di  : num  2.17 4.95 17.41 6.12 19.2 ...
##  $ (Er170)Di  : num  2.89 10.98 9.75 3.96 1.56 ...
##  $ (Yb171)Di  : num  7.4 4.23 1.25 3.45 14.83 ...
##  $ (Yb172)Di  : num  2.217 -0.607 -0.621 2.97 0.429 ...
##  $ (Yb173)Di  : num  0.1112 0.0971 -0.7792 -0.6985 -0.9571 ...
##  $ (Yb174)Di  : num  8.02 2.37 6.65 4.49 4.31 ...
##  $ (Lu175)Di  : num  16.18 6.51 9.07 10.41 11.11 ...
##  $ (Lu176)Di  : num  1.443 1.317 3.223 8.252 0.216 ...
##  $ FileNum    : num  1.067 1.007 1.059 0.973 0.938 ...
##  $ file_name  : chr  "Subselection H6PP&PB/PBD000H06_BB078.fcs" "Subselection H6PP&PB/PBD000H06_BB078.fcs" "Subselection H6PP&PB/PBD000H06_BB078.fcs" "Subselection H6PP&PB/PBD000H06_BB078.fcs" ...

Re-naming columns

df_samples %<>% inner_join(sample_tableVA1,by = "file_name")
oldnames = tb_marker$isotope
newnames = tb_marker$protein_name
df_samples %<>% rename_at(vars(oldnames), ~ newnames)
str(df_samples)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1630186 obs. of  46 variables:
##  $ Time       : num  0 1 18 20 28 38 38 43 49 52 ...
##  $ Cell_length: num  18 16 22 25 25 16 24 29 30 19 ...
##  $ (Rh103)Di  : num  -0.62596 -0.9804 -0.00455 -0.93974 -0.13917 ...
##  $ (Xe131)Di  : num  -0.939 -0.431 -0.332 -0.227 -0.608 ...
##  $ (Cs133)Di  : num  -0.672 -0.042 -0.305 -0.618 -0.771 ...
##  $ (La139)Di  : num  -0.753 0.737 2.013 0.66 -0.521 ...
##  $ (Ce140)Di  : num  -0.4145 0.1893 -0.4858 1.6834 0.0661 ...
##  $ CD66       : num  321 214 285 190 365 ...
##  $ HLADR      : num  1.866 7.979 1.981 0.399 2.454 ...
##  $ CD3        : num  1.655 0.784 2.231 1.159 2.209 ...
##  $ CD107a     : num  1.34 1.67 8.47 10.92 4.79 ...
##  $ CD8        : num  0.488 0.257 -0.128 1.658 2.641 ...
##  $ CD45       : num  20.7 24.7 10.8 42.3 10.2 ...
##  $ IL4        : num  -0.175 0.562 -0.799 -0.066 1.991 ...
##  $ GranzymeB  : num  1.6758 -0.5441 -0.0705 -0.3159 0.4077 ...
##  $ CD56       : num  2.94 5.12 4.23 1.48 7.37 ...
##  $ CD62L      : num  3.76 15.1 19.23 14.93 29.95 ...
##  $ (Eu151)Di  : num  -0.686 -0.727 -0.937 -0.989 -0.89 ...
##  $ CD4        : num  3.321 -0.863 1.287 1.866 0.873 ...
##  $ CD11a      : num  2.11 1.89 7.74 5.42 4.5 ...
##  $ CD2        : num  0.444 3.78 -0.449 0.998 2.422 ...
##  $ CD7        : num  0.505 -0.239 1.003 1.893 0.546 ...
##  $ MIP1B      : num  -0.701 -0.508 -0.315 -0.684 -0.683 ...
##  $ (Gd158)Di  : num  -0.746 -0.532 0.515 -0.311 0.287 ...
##  $ TNFa       : num  -0.038 1.3318 -0.2569 -0.1783 -0.0139 ...
##  $ Ki67       : num  -0.945 -0.0115 0.6834 -0.7285 0.9801 ...
##  $ NKG2D      : num  2.109 0.652 4.146 0.787 5.814 ...
##  $ CD11c      : num  3.697 0.785 2.09 1.179 1.723 ...
##  $ (Dy163)Di  : num  -0.3061 0.5433 1.6583 -0.0153 -0.9743 ...
##  $ CD69       : num  2.1 2.54 3.09 3.85 5.95 ...
##  $ IFNg       : num  -0.837 1.202 5.915 6.842 13.71 ...
##  $ CD25       : num  11.88 6.3 8.78 7.63 14.38 ...
##  $ CD16       : num  -0.8443 -0.0757 1.0502 -0.3283 1.0173 ...
##  $ CCR5       : num  0.734 1.001 3.536 1.328 2.746 ...
##  $ CXCR4      : num  2.17 4.95 17.41 6.12 19.2 ...
##  $ CD14       : num  2.89 10.98 9.75 3.96 1.56 ...
##  $ perforine  : num  7.4 4.23 1.25 3.45 14.83 ...
##  $ NKG2A      : num  2.217 -0.607 -0.621 2.97 0.429 ...
##  $ (Yb173)Di  : num  0.1112 0.0971 -0.7792 -0.6985 -0.9571 ...
##  $ CD20       : num  8.02 2.37 6.65 4.49 4.31 ...
##  $ CCR7       : num  16.18 6.51 9.07 10.41 11.11 ...
##  $ IL10       : num  1.443 1.317 3.223 8.252 0.216 ...
##  $ File Number: num  1.067 1.007 1.059 0.973 0.938 ...
##  $ file_name  : chr  "Subselection H6PP&PB/PBD000H06_BB078.fcs" "Subselection H6PP&PB/PBD000H06_BB078.fcs" "Subselection H6PP&PB/PBD000H06_BB078.fcs" "Subselection H6PP&PB/PBD000H06_BB078.fcs" ...
##  $ donor      : chr  "BB078" "BB078" "BB078" "BB078" ...
##  $ term       : Factor w/ 2 levels "Post-boost","Post-prime": 1 1 1 1 1 1 1 1 1 1 ...
#factor
df_samples$term %<>% factor(levels = c("Post-prime",
                                           "Post-boost"))

Cell counts listed per donor and condition

table(df_samples$donor,df_samples$term)
##        
##         Post-prime Post-boost
##   BB078     163789     206027
##   BB231     228105     330010
##   BC641     172985     264225
##   BD620      93624     171421

In a first time, we will focus on the functional proteins. In a second step, a larger group of proteins could be looked at. The original authors used all 31 markers (function & phenotype), excluding unused proteins.

#1. only function proteins
protein_names_func = tb_marker %>% 
  dplyr::filter(type == "function") %>%
  .$protein_name
protein_names_func
## [1] "IL4"   "MIP1B" "TNFa"  "Ki67"  "IFNg"  "IL10"
#2. all used proteins
protein_names_all = tb_marker %>% 
  dplyr::filter(type != "unused") %>%
  .$protein_name
protein_names_all
##  [1] "CD66"      "HLADR"     "CD3"       "CD107a"    "CD8"      
##  [6] "CD45"      "IL4"       "GranzymeB" "CD56"      "CD62L"    
## [11] "CD4"       "CD11a"     "CD2"       "CD7"       "MIP1B"    
## [16] "TNFa"      "Ki67"      "NKG2D"     "CD11c"     "CD69"     
## [21] "IFNg"      "CD25"      "CD16"      "CCR5"      "CXCR4"    
## [26] "CD14"      "perforine" "NKG2A"     "CD20"      "CCR7"     
## [31] "IL10"

Declare the columns in df_samples that are not protein markers. In our example, we have donor ID (animal), time point when the sample was collected (before or after the boost vaccination), FCS filename.

sample_info_names = c(names(sample_tableVA1))
sample_info_names
## [1] "donor"     "term"      "file_name"

5 Data Exploration

5.1 Transform Data

The arcsin (x/5) transformation is applied as it models noise non-uniformly, corrects for negative values and normally distributed cell types. More in methodology section of thesis in the section about logistic regression in the generalised linear models part.

trans_func = function(x) asinh(x/5)
#with functional markers
df_samples_tfm = df_samples %>% mutate_at(protein_names_func, trans_func)

#with all used markers
df_samples_all = df_samples %>% mutate_at(protein_names_all, trans_func)

#factor the term so that post-prime comes first as it chronologically does
df_samples_tfm$term = factor(df_samples_tfm$term, levels = c("Post-prime", "Post-boost"))

5.2 Multidimensional scaling (MDS)

MDS on median marker expression of NK cells.

CytoGLMM::plot_mds(df_samples_tfm,
                   protein_names = protein_names_func,
                   sample_info_names = sample_info_names,
                   color = "term")

5.3 Heatmap

Heatmap of median marker expression of NK cells for functional proteins.

CytoGLMM::plot_heatmap(df_samples_tfm,
                       protein_names = protein_names_func,
                       sample_info_names = sample_info_names,
                       arrange_by_1 = "term")

There is no distinctive difference in protein concentration between post-prime and post-boost cases. TNFq, Ki67, IL4, and MIP1B are not present in high concentrations. IFNg and IL10 have higher quantities at both time points.

Heatmap of median marker expression of NK cells for all used proteins.

CytoGLMM::plot_heatmap(df_samples_all,
                       protein_names = protein_names_all,
                       sample_info_names = sample_info_names,
                       arrange_by_1 = "term")

CytoGLMM::plot_heatmap(df_samples_all,
                       protein_names = protein_names_all [which(protein_names_all != "CD66")],
                       sample_info_names = sample_info_names,
                       arrange_by_1 = "term")

Interesting to look at all protein markers, does not seem to be a strong visual difference between two groups. CD66 is very highly present, goes up to above 4 on scale. When CD66 is removed, then there are still very high amounts of CD45 and perforine. There is markedly more CD66 and CD45 in post-boost in comparison to post-prime. CD11a is also more present in post-boost in comparison to the post-prime condition.

5.4 PCA

PCA plot of NK cells.

CytoGLMM::plot_prcomp(df_samples_tfm,
                      protein_names = protein_names_func,
                      color_var = "term",
                      repel = TRUE)

5.5 Density Plots

Density plots of the marker IFNg, CD16, CD66, CD45, and TNFa for all donors.

ggplot(df_samples_tfm, aes_string(x = "IFNg", color = "term")) + 
  geom_density() + 
  facet_wrap(~donor)

ggplot(df_samples_tfm, aes_string(x = "CD16", color = "term")) + 
  geom_density() + 
  facet_wrap(~donor)

ggplot(df_samples_all, aes_string(x = "CD66", color = "term")) + 
  geom_density() + 
  facet_wrap(~donor)

ggplot(df_samples_all, aes_string(x = "CD45", color = "term")) + 
  geom_density() + 
  facet_wrap(~donor)

ggplot(df_samples_all, aes_string(x = "TNFa", color = "term")) + 
  geom_density() + 
  facet_wrap(~donor)

From the plots about IFNg, it can be seen that in most animals, there is no strong difference in IFNg. CD16 is incredily low or absent in all animals in both conditions. CD66 demonstrates a clear shift in the boost condition, with more CD66 present and in higher density. CD45 shows a similar shift to CD66 although not quite as large a difference. TNFa is present in very low quantities in all conditions.

5.6 Two-Dimensional Histograms

Two-dimensional histograms for plotting IFNg and IL10, two functional markers for all donors. Furthermore, histograms are created for CD3 and CD8, CD16 and CD56, and CD66 and CD56.

colorscale = scale_fill_gradientn(
  colors = rev(brewer.pal(9, "YlGnBu")), 
  values = c(0, exp(seq(-5, 0, length.out = 100)))
  )
ggplot(df_samples_tfm, aes_string(x = "IFNg", y = "IL10")) + 
  geom_hex(bins = 64) +
  colorscale + 
  coord_fixed() +
  facet_wrap(~donor)

ggplot(df_samples_all, aes_string(x = "CD3", y = "CD8")) + 
  geom_hex(bins = 64) +
  colorscale + 
  coord_fixed() 

ggplot(df_samples_all, aes_string(x = "CD16", y = "CD56")) + 
  geom_hex(bins = 64) +
  colorscale + 
  coord_fixed() +
  facet_wrap(~donor)

ggplot(df_samples_all, aes_string(x = "CD66", y = "CD56")) + 
  geom_hex(bins = 64) +
  colorscale + 
  coord_fixed() +
  facet_wrap(~donor)

Il-10 - IFNg plot shows that generally both markers are low. CD3 - CD8 plot is not confirming the gating as they should be CD3 negative and CD8 positive. There seems to be a wide variety and a skew in the highest amount of cells being more CD3 positive and CD8 negative, rather then the other way around. Concerning the CD56 and CD16, it looks like a predominance of CD56 bright CD16 dim cells. All animals seem to have similar histograms.

5.6.1 Two-dimensional histograms for group comparisons.

ggplot(df_samples_tfm, aes_string(x = "IFNg", y = "IL10")) + 
  geom_hex(bins = 64) +
  colorscale + 
  coord_fixed() +
  facet_wrap(~term)

ggplot(df_samples_all, aes_string(x = "CD66", y = "CD56")) + 
  geom_hex(bins = 64) +
  colorscale + 
  coord_fixed() +
  facet_wrap(~term)

ggplot(df_samples_all, aes_string(x = "CD16", y = "CD56")) + 
  geom_hex(bins = 64) +
  colorscale + 
  coord_fixed() +
  facet_wrap(~term)

With CD56 and CD66 a clear shift is visible to more CD66 positive NK cells in the boost condition, both CD56 positive and negative. A general right shift is seen.

Amount of CD16 seems not to change much between two conditions except for a few outliers. There is a shift of the bulk having less CD56 but tail end of CD56 is slightly denser.

5.6.2 NK cell count. List the smallest and largest.

df_samples_tfm %>% group_by(term,donor) %>% tally %>% arrange(n)
## # A tibble: 8 x 3
## # Groups:   term [2]
##   term       donor      n
##   <fct>      <chr>  <int>
## 1 Post-prime BD620  93624
## 2 Post-prime BB078 163789
## 3 Post-boost BD620 171421
## 4 Post-prime BC641 172985
## 5 Post-boost BB078 206027
## 6 Post-prime BB231 228105
## 7 Post-boost BC641 264225
## 8 Post-boost BB231 330010
df_samples_tfm %>% group_by(term,donor) %>% tally %>% arrange(desc(n))
## # A tibble: 8 x 3
## # Groups:   term [2]
##   term       donor      n
##   <fct>      <chr>  <int>
## 1 Post-boost BB231 330010
## 2 Post-boost BC641 264225
## 3 Post-prime BB231 228105
## 4 Post-boost BB078 206027
## 5 Post-prime BC641 172985
## 6 Post-boost BD620 171421
## 7 Post-prime BB078 163789
## 8 Post-prime BD620  93624

5.7 Marker Correlations

Plot marker correlations for functional proteins.

mcor = cor(df_samples_tfm %>% dplyr::select(protein_names_func))
ggcorrplot(mcor, hc.order = TRUE, type = "lower", 
           outline.col = "lightgray",
           colors = c("#6D9EC1", "white", "#E46726"))+
  theme(axis.text.x  = element_text(angle = 90, vjust=0))

Functional markers are not highly correlated with each other.

Plot marker correlations for all used proteins.

mcor = cor(df_samples_all %>% dplyr::select(protein_names_all))
ggcorrplot(mcor, hc.order = TRUE, type = "lower", 
           outline.col = "lightgray",
           colors = c("#6D9EC1", "white", "#E46726"))+
  theme(axis.text.x  = element_text(angle = 90, vjust=0))

There are a few phenotype markers that are slightly positively correlated such as CCR5 and CD107a.

6 Regression Analysis on Summarized Data

Classical differential analysis approach comparing median marker expressions.

6.1 Plot Median Marker Expression

Plot all celltypes.

#functional proteins
df_median_fct = df_samples_tfm %>%
      group_by(file_name, donor, term) %>%
      summarise_at(protein_names_func, median)
df_median_long_fct = gather(df_median_fct, protein_names_func, median_expr, 
                        -file_name, -donor, -term)

ggplot(df_median_long_fct, aes(protein_names_func, median_expr, color = term)) + 
  geom_violin() + 
  geom_jitter(width = 0.2, alpha = 0.5) + 
  facet_wrap(~ protein_names_func, nrow = 2) +
  theme(axis.text.x = element_blank()) +
  ggtitle("NK")

#all proteins
df_median_all = df_samples_all %>%
      group_by(file_name, donor, term) %>%
      summarise_at(protein_names_all, median)
df_median_long_all = gather(df_median_all, protein_names_all, median_expr, 
                        -file_name, -donor, -term)

Zoom in on marker IFNg and CD66

ggplot(df_median_fct, aes(term, IFNg, color = term)) + 
  geom_violin() + 
  geom_jitter(width = 0.2, alpha = 0.5) + 
  theme(axis.text.x = element_blank())

ggplot(df_median_all, aes(term, CD66, color = term)) + 
  geom_violin() + 
  geom_jitter(width = 0.2, alpha = 0.5) + 
  theme(axis.text.x = element_blank())

As their are so few datapoints it is hard to make conclusive statements.

6.2 Linear Mixed Model

Mixed model with median expression as response variable, experimental condition as explanatory variable, and donor as random effect. Fit separate models for each protein.

calc_pvalue = function(fit) {
  summ = summary(fit)
  coefficients(summ)["termPost-boost", "Pr(>|t|)"]
}
#functional markers
df_median_long_fct = gather(df_median_fct, protein_name, median_expr, 
                        -file_name, -donor, -term)
df_fits_fct = df_median_long_fct %>% 
  group_by(protein_name) %>%
  nest() %>% 
  mutate(fit = map(data, ~ lmer(median_expr ~ term + (1|donor), .))) %>%
  mutate(pvalue_unadj = map_dbl(fit, ~ calc_pvalue(.))) %>%
  mutate(pvalue_adj = p.adjust(pvalue_unadj, method = "BH")) %>%
  dplyr::select(protein_name, pvalue_adj)
df_fits_fct
## # A tibble: 6 x 2
##   protein_name pvalue_adj
##   <chr>             <dbl>
## 1 IL4               0.264
## 2 MIP1B             0.264
## 3 TNFa              0.973
## 4 Ki67              0.973
## 5 IFNg              0.224
## 6 IL10              0.224
df_fits_fct %>% 
  dplyr::filter(pvalue_adj < 0.05) %>% 
  print(n = Inf)
## # A tibble: 0 x 2
## # … with 2 variables: protein_name <chr>, pvalue_adj <dbl>
#all used markers
df_median_all$term %<>% factor(levels=c("Post-prime", "Post-boost"))
df_median_long_all = gather(df_median_all, protein_name, median_expr, 
                        -file_name, -donor, -term)
df_fits_all = df_median_long_all %>% 
  group_by(protein_name) %>%
  nest() %>% 
  mutate(fit = map(data, ~ lmer(median_expr ~ term + (1|donor), .))) %>%
  mutate(pvalue_unadj = map_dbl(fit, ~ calc_pvalue(.))) %>%
  mutate(pvalue_adj = p.adjust(pvalue_unadj, method = "BH")) %>%
  dplyr::select(protein_name, pvalue_adj)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.0106521
## (tol = 0.002, component 1)
df_fits_all
## # A tibble: 31 x 2
##    protein_name pvalue_adj
##    <chr>             <dbl>
##  1 CD66             0.0444
##  2 HLADR            0.0444
##  3 CD3              0.705 
##  4 CD107a           0.0882
##  5 CD8              0.499 
##  6 CD45             0.0786
##  7 IL4              0.437 
##  8 GranzymeB        0.437 
##  9 CD56             0.437 
## 10 CD62L            0.981 
## # … with 21 more rows
df_fits_all %>% 
  dplyr::filter(pvalue_adj < 0.05) %>% 
  print(n = Inf)
## # A tibble: 2 x 2
##   protein_name pvalue_adj
##   <chr>             <dbl>
## 1 CD66             0.0444
## 2 HLADR            0.0444

None of the 6 functional proteins seem to be significant. Looking at all used proteins, CD66 and HLADR are significant.

7 Regression Analysis on All The Data

For the regression analysis, we will first focus on the functional proteins. We will then look at all used markers.

8 Functional proteins

8.1 Generalized Linear Mixed Model for functional proteins

Fit a Generalized Linear Mixed Model (GLMM) with donor random effects. This function is a wrapper around the package mbest.

glmm_fit = CytoGLMM::cytoglmm(df_samples_tfm, 
                              protein_names = protein_names_func,
                              condition = "term", group = "donor")
## 2019-06-18 10:50:38 INFO:mbest.mhglm:Creating design matrix
## 2019-06-18 10:50:38 INFO:mbest.mhglm:Grouping factor and random effect design matrix
## 2019-06-18 10:50:38 INFO:mbest.mhglm:Setting weights
## 2019-06-18 10:50:38 INFO:mbest.mhglm:Setting offset
## 2019-06-18 10:50:38 INFO:mbest.mhglm:Fitting model
## 2019-06-18 10:50:38 INFO:mbest.mhglm.fit:Fitting models in parallel
## 2019-06-18 10:50:57 INFO:mbest.mhglm.fit:Computing mean and covariance estimates
## 2019-06-18 10:50:57 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:50:57 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:50:57 INFO:mbest.mhglm.fit:Computing covariance estimate
## 2019-06-18 10:50:57 INFO:mbest.mhglm.fit:Refining mean and covariance estimates
## 2019-06-18 10:50:57 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:50:57 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:50:57 INFO:mbest.mhglm.fit:Computing covariance estimate
## Warning in moment.est.cov.reducer(cov.info, diagcov, cov.rank.warn,
## cov.psd.warn): moment-based covariance matrix estimate is not positive
## semi-definite; using projection
glmm_fit
## number of cells per group and condition:       
##         Post-prime Post-boost
##   BB078     163789     206027
##   BB231     228105     330010
##   BC641     172985     264225
##   BD620      93624     171421
## 
## proteins included in the analysis:
##  IL4 MIP1B TNFa Ki67 IFNg IL10 
## 
## condition compared: term 
## grouping variable: donor
plot(glmm_fit)

summary(glmm_fit) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 4 x 3
##   protein_name pvalues_unadj pvalues_adj
##   <chr>                <dbl>       <dbl>
## 1 IL10              3.96e-16    2.38e-15
## 2 IL4               9.52e- 7    2.86e- 6
## 3 IFNg              1.67e- 5    3.35e- 5
## 4 MIP1B             3.56e- 4    5.35e- 4

Add IL10, IL4, IFNg, MIP1B into one marker.

df_samples_tfm %<>% mutate(signif_sum = IL10+IL4+IFNg+MIP1B)
protein_names_sum = c(
  "signif_sum", 
  protein_names_func[!protein_names_func %in% c("IL10", "IL4", "IFNg", "MIP1B")]
)
glmm_fit = CytoGLMM::cytoglmm(df_samples_tfm, 
                              protein_names = protein_names_sum,
                              condition = "term", group = "donor")
## 2019-06-18 10:51:03 INFO:mbest.mhglm:Creating design matrix
## 2019-06-18 10:51:03 INFO:mbest.mhglm:Grouping factor and random effect design matrix
## 2019-06-18 10:51:03 INFO:mbest.mhglm:Setting weights
## 2019-06-18 10:51:03 INFO:mbest.mhglm:Setting offset
## 2019-06-18 10:51:03 INFO:mbest.mhglm:Fitting model
## 2019-06-18 10:51:04 INFO:mbest.mhglm.fit:Fitting models in parallel
## 2019-06-18 10:51:14 INFO:mbest.mhglm.fit:Computing mean and covariance estimates
## 2019-06-18 10:51:14 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:51:14 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:51:14 INFO:mbest.mhglm.fit:Computing covariance estimate
## 2019-06-18 10:51:14 INFO:mbest.mhglm.fit:Refining mean and covariance estimates
## 2019-06-18 10:51:14 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:51:14 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:51:14 INFO:mbest.mhglm.fit:Computing covariance estimate
plot(glmm_fit)

summary(glmm_fit) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 1 x 3
##   protein_name pvalues_unadj pvalues_adj
##   <chr>                <dbl>       <dbl>
## 1 signif_sum        7.46e-11    2.24e-10

Take differences between IL10, IL4, IFNg, MIP1B.

df_samples_tfm %<>% mutate(IL10_minus_MIP1B = IL10-MIP1B)
df_samples_tfm %<>% mutate(IL4_minus_MIP1B = IL4-MIP1B)
df_samples_tfm %<>% mutate(IFNg_minus_MIP1B = IFNg-MIP1B)
protein_names_diff = c(
  "IL10_minus_MIP1B","IL4_minus_MIP1B","IFNg_minus_MIP1B",
  protein_names_func[!protein_names_func %in% c("IL10","IL4", "IFNg")]
)
glmm_fit = CytoGLMM::cytoglmm(df_samples_tfm, 
                              protein_names = protein_names_diff,
                              condition = "term", group = "donor")
## 2019-06-18 10:51:19 INFO:mbest.mhglm:Creating design matrix
## 2019-06-18 10:51:19 INFO:mbest.mhglm:Grouping factor and random effect design matrix
## 2019-06-18 10:51:19 INFO:mbest.mhglm:Setting weights
## 2019-06-18 10:51:19 INFO:mbest.mhglm:Setting offset
## 2019-06-18 10:51:19 INFO:mbest.mhglm:Fitting model
## 2019-06-18 10:51:19 INFO:mbest.mhglm.fit:Fitting models in parallel
## 2019-06-18 10:51:37 INFO:mbest.mhglm.fit:Computing mean and covariance estimates
## 2019-06-18 10:51:37 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:51:37 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:51:37 INFO:mbest.mhglm.fit:Computing covariance estimate
## 2019-06-18 10:51:37 INFO:mbest.mhglm.fit:Refining mean and covariance estimates
## 2019-06-18 10:51:37 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:51:37 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:51:37 INFO:mbest.mhglm.fit:Computing covariance estimate
## Warning in moment.est.cov.reducer(cov.info, diagcov, cov.rank.warn,
## cov.psd.warn): moment-based covariance matrix estimate is not positive
## semi-definite; using projection
plot(glmm_fit)

summary(glmm_fit) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 4 x 3
##   protein_name     pvalues_unadj pvalues_adj
##   <chr>                    <dbl>       <dbl>
## 1 IL10_minus_MIP1B      3.96e-16    2.38e-15
## 2 MIP1B                 1.45e-11    4.36e-11
## 3 IL4_minus_MIP1B       9.52e- 7    1.90e- 6
## 4 IFNg_minus_MIP1B      1.67e- 5    2.51e- 5

Add interactions between IL10, IL4, IFNg, MIP1B.

#2
df_samples_tfm %<>% mutate(IL10_IL4 = IL10*IL4)
df_samples_tfm %<>% mutate(IL10_IFNg = IL10*IFNg)
df_samples_tfm %<>% mutate(IL10_MIP1B = IL10*MIP1B)
df_samples_tfm %<>% mutate(IL4_IFNg = IL4*IFNg)
df_samples_tfm %<>% mutate(IL4_MIP1B = IL4*MIP1B)
df_samples_tfm %<>% mutate(IFNg_MIP1B = IFNg*MIP1B)
#3
df_samples_tfm %<>% mutate(IL10_IL4_IFNg = IL10*IL4*IFNg)
df_samples_tfm %<>% mutate(IL10_IL4_MIP1B = IL10*IL4*MIP1B)
df_samples_tfm %<>% mutate(IL10_IFNg_MIP1B = IL10*IFNg*MIP1B)
df_samples_tfm %<>% mutate(IL4_IFNg_MIP1B = IL4*IFNg*MIP1B)
#4
df_samples_tfm %<>% mutate(IL10_IL4_IFNg_MIP1B = IL10*IL4*IFNg*MIP1B)
protein_names_interactions = c(protein_names_func,"IL10_IL4","IL10_IFNg", "IL10_MIP1B","IL4_IFNg", "IL4_MIP1B", "IFNg_MIP1B", "IL10_IL4_IFNg", "IL10_IL4_MIP1B", "IL10_IFNg_MIP1B", "IL4_IFNg_MIP1B", "IL10_IL4_IFNg_MIP1B")
glmm_fit = CytoGLMM::cytoglmm(df_samples_tfm, 
                              protein_names = protein_names_interactions,
                              condition = "term", group = "donor")
## 2019-06-18 10:51:44 INFO:mbest.mhglm:Creating design matrix
## 2019-06-18 10:51:44 INFO:mbest.mhglm:Grouping factor and random effect design matrix
## 2019-06-18 10:51:44 INFO:mbest.mhglm:Setting weights
## 2019-06-18 10:51:44 INFO:mbest.mhglm:Setting offset
## 2019-06-18 10:51:44 INFO:mbest.mhglm:Fitting model
## 2019-06-18 10:51:45 INFO:mbest.mhglm.fit:Fitting models in parallel
## 2019-06-18 10:53:26 INFO:mbest.mhglm.fit:Computing mean and covariance estimates
## 2019-06-18 10:53:26 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:53:26 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:53:26 INFO:mbest.mhglm.fit:Computing covariance estimate
## 2019-06-18 10:53:26 INFO:mbest.mhglm.fit:Refining mean and covariance estimates
## 2019-06-18 10:53:26 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:53:26 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:53:26 INFO:mbest.mhglm.fit:Computing covariance estimate
## Warning in moment.est.cov.reducer(cov.info, diagcov, cov.rank.warn,
## cov.psd.warn): moment-based covariance matrix estimate is not positive
## semi-definite; using projection
plot(glmm_fit)

summary(glmm_fit) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 7 x 3
##   protein_name        pvalues_unadj  pvalues_adj
##   <chr>                       <dbl>        <dbl>
## 1 IL4                 0.00000000559 0.0000000951
## 2 IL10                0.0000385     0.000327    
## 3 MIP1B               0.000121      0.000613    
## 4 IL10_IL4_IFNg_MIP1B 0.000144      0.000613    
## 5 IFNg                0.000302      0.00103     
## 6 IL10_IFNg           0.00505       0.0143      
## 7 IL10_IL4_MIP1B      0.0106        0.0259

8.2 Generalized Linear Model with Bootstrap for functional proteins

Instead of modeling the donor effect, we can use bootstrap resampling. In our experience, this type of regression gives also good results when samples are not matched between conditions on the same donor.

glm_fit = CytoGLMM::cytoglm(df_samples_tfm, 
                            num_boot = 100,
                            protein_names = protein_names_func,
                            condition = "term", group = "donor",
                            cell_n_subsample = 1000, # Christof: to save memory
                            num_cores = 1 # Christof: to save memory
                            ) 
glm_fit
## 
## #######################
## ## paired analysis ####
## #######################
## 
## number of bootstrap samples: 100 
## 
## number of cells per group and condition:       
##         Post-prime Post-boost
##   BB078       1000       1000
##   BB231       1000       1000
##   BC641       1000       1000
##   BD620       1000       1000
## 
## proteins included in the analysis:
##  IL4 MIP1B TNFa Ki67 IFNg IL10 
## 
## condition compared: term 
## grouping variable: donor
plot(glm_fit)

summary(glm_fit) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 4 x 3
##   protein_name pvalues_unadj pvalues_adj
##   <chr>                <dbl>       <dbl>
## 1 IFNg                  0.02        0.03
## 2 IL10                  0.02        0.03
## 3 IL4                   0.02        0.03
## 4 MIP1B                 0.02        0.03

8.3 Mixture of Regressions for functional proteins

Fit a mixture of regression model to identity clusters of donors or outliers. This function is a wrapper around the package flexmix.

num_donors = nlevels(as.factor(df_samples_tfm$donor))
mix_fit = CytoGLMM::cytoflexmix(df_samples_tfm, 
                                protein_names = protein_names_func,
                                condition = "term", group = "donor", 
                                ks = 1:num_donors,
                                cell_n_subsample = 5000, # Christof: to save memory
                                num_cores = 2 # Christof: to save memory
                                )
## 1 : * * * * *
## 2 : * * * * *
## 3 : * * * * *
## 4 : * * * * *
plot(mix_fit)

The plotting function automatically uses the BIC criterion to select the number of clusters. In this case, it picks 10 clusters.

plot_model_selection(mix_fit)

9 All used proteins

9.1 Generalized Linear Mixed Model for all used proteins

Fit a Generalized Linear Mixed Model (GLMM) with donor random effects. This function is a wrapper around the package mbest [@perry2017fast].

NOTE TO SELF: Doing it will all cells crashed the computer

glmm_fit_all = CytoGLMM::cytoglmm(df_samples_all, 
                              protein_names = protein_names_all,
                              condition = "term", group = "donor",
                              cell_n_subsample = 3000, # to save memory
                                num_cores = 1
                              )
## 2019-06-18 10:54:25 INFO:mbest.mhglm:Creating design matrix
## 2019-06-18 10:54:25 INFO:mbest.mhglm:Grouping factor and random effect design matrix
## 2019-06-18 10:54:25 INFO:mbest.mhglm:Setting weights
## 2019-06-18 10:54:25 INFO:mbest.mhglm:Setting offset
## 2019-06-18 10:54:25 INFO:mbest.mhglm:Fitting model
## 2019-06-18 10:54:25 INFO:mbest.mhglm.fit:Fitting models in parallel
## 2019-06-18 10:54:28 INFO:mbest.mhglm.fit:Computing mean and covariance estimates
## 2019-06-18 10:54:28 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:54:28 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:54:28 INFO:mbest.mhglm.fit:Computing covariance estimate
## 2019-06-18 10:54:29 INFO:mbest.mhglm.fit:Refining mean and covariance estimates
## 2019-06-18 10:54:29 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:54:29 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:54:29 INFO:mbest.mhglm.fit:Computing covariance estimate
## Warning in moment.est.cov.reducer(cov.info, diagcov, cov.rank.warn,
## cov.psd.warn): moment-based covariance matrix estimate is not positive
## semi-definite; using projection
glmm_fit_all
## number of cells per group and condition:       
##         Post-prime Post-boost
##   BB078       3000       3000
##   BB231       3000       3000
##   BC641       3000       3000
##   BD620       3000       3000
## 
## proteins included in the analysis:
##  CD66 HLADR CD3 CD107a CD8 CD45 IL4 GranzymeB CD56 CD62L CD4 CD11a CD2 CD7 MIP1B TNFa Ki67 NKG2D CD11c CD69 IFNg CD25 CD16 CCR5 CXCR4 CD14 perforine NKG2A CD20 CCR7 IL10 
## 
## condition compared: term 
## grouping variable: donor
plot(glmm_fit_all)

summary(glmm_fit_all) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 16 x 3
##    protein_name pvalues_unadj pvalues_adj
##    <chr>                <dbl>       <dbl>
##  1 CD66             5.32e-282   1.65e-280
##  2 HLADR            1.56e- 40   2.42e- 39
##  3 CD107a           2.63e- 36   2.72e- 35
##  4 IFNg             1.37e- 25   1.06e- 24
##  5 CD56             3.24e- 24   2.01e- 23
##  6 CD45             3.66e- 19   1.89e- 18
##  7 CD11a            2.75e- 15   1.22e- 14
##  8 IL10             9.48e- 13   3.67e- 12
##  9 MIP1B            1.29e-  9   4.43e-  9
## 10 CD2              7.40e-  8   2.29e-  7
## 11 CD62L            5.85e-  6   1.65e-  5
## 12 IL4              1.26e-  4   3.26e-  4
## 13 CD3              2.66e-  4   6.35e-  4
## 14 CD69             3.63e-  4   8.04e-  4
## 15 CD8              5.99e-  4   1.24e-  3
## 16 CD4              2.55e-  2   4.95e-  2
summary(glmm_fit_all) %>% dplyr::filter(pvalues_adj < 0.05) 
## # A tibble: 16 x 3
##    protein_name pvalues_unadj pvalues_adj
##    <chr>                <dbl>       <dbl>
##  1 CD66             5.32e-282   1.65e-280
##  2 HLADR            1.56e- 40   2.42e- 39
##  3 CD107a           2.63e- 36   2.72e- 35
##  4 IFNg             1.37e- 25   1.06e- 24
##  5 CD56             3.24e- 24   2.01e- 23
##  6 CD45             3.66e- 19   1.89e- 18
##  7 CD11a            2.75e- 15   1.22e- 14
##  8 IL10             9.48e- 13   3.67e- 12
##  9 MIP1B            1.29e-  9   4.43e-  9
## 10 CD2              7.40e-  8   2.29e-  7
## 11 CD62L            5.85e-  6   1.65e-  5
## 12 IL4              1.26e-  4   3.26e-  4
## 13 CD3              2.66e-  4   6.35e-  4
## 14 CD69             3.63e-  4   8.04e-  4
## 15 CD8              5.99e-  4   1.24e-  3
## 16 CD4              2.55e-  2   4.95e-  2

Add CD66, CD107a, CD56, HLADR, CD45, IFNg, CD11a, MIP1B, IL10, CD8, CD2, CD62L, IL4, CD3 into one marker.

df_samples_all %<>% mutate(signif_sum = CD66+CD107a+CD56+HLADR+CD45+IFNg+CD11a+MIP1B+IL10+CD8+CD2+CD62L+IL4+CD3)
protein_names_sum_all = c(
  "signif_sum", 
  protein_names_all[!protein_names_all %in% c("CD66", "CD107a", "CD56", "HLADR", "CD45", "IFNg", "CD11a", "MIP1B", "IL10", "CD8", "CD2", "CD62L", "IL4", "CD3")]
)
glmm_fit_all = CytoGLMM::cytoglmm(df_samples_all, 
                              protein_names = protein_names_sum_all,
                              condition = "term", group = "donor",
                              cell_n_subsample = 3000,
                                num_cores = 1
                              )
## 2019-06-18 10:54:34 INFO:mbest.mhglm:Creating design matrix
## 2019-06-18 10:54:34 INFO:mbest.mhglm:Grouping factor and random effect design matrix
## 2019-06-18 10:54:34 INFO:mbest.mhglm:Setting weights
## 2019-06-18 10:54:34 INFO:mbest.mhglm:Setting offset
## 2019-06-18 10:54:34 INFO:mbest.mhglm:Fitting model
## 2019-06-18 10:54:34 INFO:mbest.mhglm.fit:Fitting models in parallel
## 2019-06-18 10:54:36 INFO:mbest.mhglm.fit:Computing mean and covariance estimates
## 2019-06-18 10:54:36 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:54:36 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:54:36 INFO:mbest.mhglm.fit:Computing covariance estimate
## 2019-06-18 10:54:36 INFO:mbest.mhglm.fit:Refining mean and covariance estimates
## 2019-06-18 10:54:36 INFO:mbest.mhglm.fit:Estimating moments
## 2019-06-18 10:54:36 INFO:mbest.mhglm.fit:Computing mean estimate
## 2019-06-18 10:54:36 INFO:mbest.mhglm.fit:Computing covariance estimate
## Warning in moment.est.cov.reducer(cov.info, diagcov, cov.rank.warn,
## cov.psd.warn): moment-based covariance matrix estimate is not positive
## semi-definite; using projection
plot(glmm_fit_all)

summary(glmm_fit_all) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 12 x 3
##    protein_name pvalues_unadj pvalues_adj
##    <chr>                <dbl>       <dbl>
##  1 signif_sum        3.35e-19    6.03e-18
##  2 CD7               3.18e- 7    2.86e- 6
##  3 CD16              2.03e- 6    1.22e- 5
##  4 CCR5              5.86e- 6    2.64e- 5
##  5 CD69              3.91e- 5    1.18e- 4
##  6 CCR7              3.94e- 5    1.18e- 4
##  7 CD25              1.37e- 4    3.52e- 4
##  8 CD14              6.45e- 4    1.45e- 3
##  9 NKG2A             1.69e- 3    3.38e- 3
## 10 CD4               1.91e- 3    3.44e- 3
## 11 NKG2D             3.24e- 3    5.30e- 3
## 12 CXCR4             3.00e- 2    4.51e- 2

9.2 Generalized Linear Model with Bootstrap for all used proteins

Instead of modeling the donor effect, we can use bootstrap resampling. In our experience, this type of regression gives also good results when samples are not matched between conditions on the same donor.

glm_fit_all = CytoGLMM::cytoglm(df_samples_all, 
                            num_boot = 100,
                            protein_names = protein_names_all,
                            condition = "term", group = "donor",
                            cell_n_subsample = 3000, # Christof: to save memory
                            num_cores = 1 # Christof: to save memory
                            ) 
glm_fit_all
## 
## #######################
## ## paired analysis ####
## #######################
## 
## number of bootstrap samples: 100 
## 
## number of cells per group and condition:       
##         Post-prime Post-boost
##   BB078       3000       3000
##   BB231       3000       3000
##   BC641       3000       3000
##   BD620       3000       3000
## 
## proteins included in the analysis:
##  CD66 HLADR CD3 CD107a CD8 CD45 IL4 GranzymeB CD56 CD62L CD4 CD11a CD2 CD7 MIP1B TNFa Ki67 NKG2D CD11c CD69 IFNg CD25 CD16 CCR5 CXCR4 CD14 perforine NKG2A CD20 CCR7 IL10 
## 
## condition compared: term 
## grouping variable: donor
plot(glm_fit_all)

summary(glm_fit_all) %>% dplyr::filter(pvalues_adj < 0.05)
## # A tibble: 14 x 3
##    protein_name pvalues_unadj pvalues_adj
##    <chr>                <dbl>       <dbl>
##  1 CD107a                0.02      0.0443
##  2 CD11a                 0.02      0.0443
##  3 CD2                   0.02      0.0443
##  4 CD3                   0.02      0.0443
##  5 CD45                  0.02      0.0443
##  6 CD56                  0.02      0.0443
##  7 CD62L                 0.02      0.0443
##  8 CD66                  0.02      0.0443
##  9 CD69                  0.02      0.0443
## 10 HLADR                 0.02      0.0443
## 11 IFNg                  0.02      0.0443
## 12 IL10                  0.02      0.0443
## 13 IL4                   0.02      0.0443
## 14 MIP1B                 0.02      0.0443

With bootstrapping, the following protein markers are shown to be significantly different CD107a, CD11a, CD16, CD2, CD3, CD45, CD56, CD62L, CD66, CD8, HLADR, IFNg, IL10, IL4, MIP1B

9.3 Mixture of Regressions for all used proteins

Fit a mixture of regression model to identity clusters of donors or outliers. This function is a wrapper around the package flexmix [@grun2007fitting].

NOTE TO SELF: Cell_n_subsample = 5000 instead of 1000 CS set; num_cores = 2 –> takes a few seconds but doesn’t crash computer and get 3 cluster assignments instead of just 1.

num_donors_all = nlevels(as.factor(df_samples_all$donor))
mix_fit_all = CytoGLMM::cytoflexmix(df_samples_all, 
                                protein_names = protein_names_all,
                                condition = "term", group = "donor", 
                                ks = 1:num_donors_all,
                                cell_n_subsample = 6000, # Christof: to save memory
                                num_cores = 1 # Christof: to save memory
                                )
## 1 : * * * * *
## 2 : * * * * *
## 3 : * * * * *
## 4 : * * * * *
plot(mix_fit_all)

The plotting function automatically uses the BIC criterion to select the number of clusters. In this case, it picks 10 clusters.

plot_model_selection(mix_fit_all)

10 SummarizedExperiment

We create a SummarizedExperiment object containing marker, sample table, and untransformed protein counts. This way we can store all the information of this experiment in one file and load it again in subsequent analyses.

tb_marker %<>% dplyr::filter(type != "none")
d_combined = df_samples_all %>% 
  select(tb_marker$protein_name) %>%
  dplyr::mutate_all(.funs = trans_func) %>%
  dplyr::mutate_all(.funs = round) %>%
  as.matrix
row_data = df_samples_all %>% 
  select(sample_info_names) %>% 
  as.data.frame
col_data = tb_marker %>% as.data.frame
se_palgen2019vaccine = SummarizedExperiment(
  assays = list(exprs = d_combined),
  colData = col_data,
  rowData = row_data
)
save(se_palgen2019vaccine, file = "se_palgen2019vaccine.Rdata")

11 Discussion

Regarding the output of the generalized linear mixed model for all proteins. Fourteen proteins were found to be significantly different between prime and boost conditions: CD66, CD107a, CD56, HLADR, CD45, IFNg, CD11a, MIP1B, IL10, CD8, CD2, CD62L, IL4, CD3.

Without modeling the donor effect, bootstrapping can be done which also gives good results. The following protein markers are shown to be significantly different with this technique CD107a, CD11a, CD16, CD2, CD3, CD45, CD56, CD62L, CD66, CD8, HLADR, IFNg, IL10, IL4, MIP1B. Except for the additional CD16, all other protein markers found are identical.

In comparison, Palgen et al. (2018) found the following proteins to be significantly upregulated by Natural Killer cells in the boost condition: Granzyme B, CD107a, perforin, CD69, CD66, CCR5, CD11c, CD16, and, slightly less strongly, CD11a. They venture a number of possible explanations to these findings in their discussion section.

The overlap between the two results is relatively similar, both found the two following markers: CD66 and CD107a. For all other proteins, the other method did not find the same result. This high discrepancy may be due to a number of reasons. There are few donors and high variation between them. Different statistical methods were applied. Furthremore, I am only using a subset of the data.

Session Info

sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] hexbin_1.27.3               lmerTest_3.1-0             
##  [3] lme4_1.1-21                 Matrix_1.2-17              
##  [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
##  [7] BiocParallel_1.16.6         matrixStats_0.54.0         
##  [9] Biobase_2.42.0              GenomicRanges_1.34.0       
## [11] GenomeInfoDb_1.18.2         IRanges_2.16.0             
## [13] S4Vectors_0.20.1            BiocGenerics_0.28.0        
## [15] ggcorrplot_0.1.3            RColorBrewer_1.1-2         
## [17] scales_1.0.0                ggcyto_1.11.5              
## [19] openCyto_1.20.2             flowWorkspace_3.30.2       
## [21] ncdfFlow_2.28.1             BH_1.69.0-1                
## [23] RcppArmadillo_0.9.400.3.0   flowCore_1.48.1            
## [25] FlowRepositoryR_1.14.1      magrittr_1.5               
## [27] forcats_0.4.0               stringr_1.4.0              
## [29] dplyr_0.8.1                 purrr_0.3.2                
## [31] readr_1.3.1                 tidyr_0.8.3                
## [33] tibble_2.1.3                ggplot2_3.1.1              
## [35] tidyverse_1.2.1             CytoGLMM_0.1.0             
## [37] Rcpp_1.0.1                 
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1           backports_1.1.4        plyr_1.8.4            
##   [4] lazyeval_0.2.2         splines_3.5.3          fda_2.4.8             
##   [7] digest_0.6.19          foreach_1.4.4          htmltools_0.3.6       
##  [10] fansi_0.4.0            cluster_2.0.9          doParallel_1.0.14     
##  [13] ks_1.11.5              recipes_0.1.5          modelr_0.1.4          
##  [16] gower_0.2.1            R.utils_2.8.0          sandwich_2.5-1        
##  [19] strucchange_1.5-1      colorspace_1.4-1       rvest_0.3.4           
##  [22] rrcov_1.4-7            ggrepel_0.8.1          haven_2.1.0           
##  [25] xfun_0.7               crayon_1.3.4           RCurl_1.95-4.12       
##  [28] jsonlite_1.6           bigmemory.sri_0.1.3    graph_1.60.0          
##  [31] zeallot_0.1.0          survival_2.44-1.1      zoo_1.8-6             
##  [34] iterators_1.0.10       glue_1.3.1             flowClust_3.20.1      
##  [37] gtable_0.3.0           XVector_0.22.0         ipred_0.9-9           
##  [40] zlibbioc_1.28.0        IDPmisc_1.1.19         Rgraphviz_2.26.0      
##  [43] DEoptimR_1.0-8         abind_1.4-5            pheatmap_1.0.12       
##  [46] mvtnorm_1.0-10         clue_0.3-57            mclust_5.4.3          
##  [49] lava_1.6.5             prodlim_2018.04.18     httr_1.4.0            
##  [52] mbest_0.6              speedglm_0.3-2         modeltools_0.2-22     
##  [55] factoextra_1.0.5       pkgconfig_2.0.2        XML_3.98-1.19         
##  [58] R.methodsS3_1.7.1      flexmix_2.3-15         nnet_7.3-12           
##  [61] flowViz_1.46.1         utf8_1.1.4             caret_6.0-84          
##  [64] labeling_0.3           flowStats_3.40.1       tidyselect_0.2.5      
##  [67] rlang_0.3.4            reshape2_1.4.3         munsell_0.5.0         
##  [70] cellranger_1.1.0       tools_3.5.3            cli_1.1.0             
##  [73] generics_0.0.2         broom_0.5.2            evaluate_0.14         
##  [76] yaml_2.2.0             ModelMetrics_1.2.2     knitr_1.23            
##  [79] robustbase_0.93-5      RBGL_1.58.2            nlme_3.1-140          
##  [82] R.oo_1.22.0            xml2_1.2.0             compiler_3.5.3        
##  [85] rstudioapi_0.10        pcaPP_1.9-73           stringi_1.4.3         
##  [88] lattice_0.20-38        nloptr_1.2.1           vctrs_0.1.0           
##  [91] pillar_1.4.1           data.table_1.12.2      cowplot_0.9.4         
##  [94] bitops_1.0-6           corpcor_1.6.9          bigmemory_4.5.33      
##  [97] R6_2.4.0               latticeExtra_0.6-28    KernSmooth_2.23-15    
## [100] gridExtra_2.3          codetools_0.2-16       gtools_3.8.1          
## [103] boot_1.3-22            MASS_7.3-51.4          assertthat_0.2.1      
## [106] withr_2.1.2            mnormt_1.5-5           GenomeInfoDbData_1.2.0
## [109] hms_0.4.2              grid_3.5.3             rpart_4.1-15          
## [112] timeDate_3043.102      class_7.3-15           minqa_1.2.4           
## [115] rmarkdown_1.13         logging_0.9-107        ggpubr_0.2            
## [118] numDeriv_2016.8-1      lubridate_1.7.4        ellipse_0.4.1